Vehicle insurance fraud is a significant problem that involves false or exaggerated claims following an accident. Fraudsters may stage accidents, fabricate injuries, or engage in other deceptive practices to make claims. To address this issue, a kaggle dataset (https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection) which includes information on vehicle attributes, accident details, and policy information has been used. The primary objective of this project is to develop a machine learning model that can assist insurance companies in identifying fraudulent claims.
In this Jupyter notebook, I conducted an extensive analysis of the data, including exploring its characteristics, splitting it into training, validation, and testing sets using a stratified approach, and pre-processing it for machine learning. I employed various techniques such as encoding categorical features and scaling numerical features to ensure optimal performance of the models. Then, I trained and fine-tuned different algorithms, including logistic regression, random forest, and XGBoost, using random search CV with 5-fold cross-validation. After comparing their performance, the best ML model was selected based on y precision, recall, and F1 score. Finally, I used the ANOVA test to explore the feature importances of the best model on the test set, providing insights into the important predictors for fraud detection.
# data manipulation
import pandas as pd
# mathematical functions
import numpy as np
from scipy.stats import randint, uniform
import random
# data visualization
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as pyo
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "notebook_connected"
# data splitting
from sklearn.model_selection import train_test_split
# data preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder
import category_encoders as ce
from category_encoders import BinaryEncoder
from sklearn.feature_selection import SelectKBest, f_classif
# algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
# model training requirements
import warnings
from sklearn.model_selection import RandomizedSearchCV, KFold
# model evaluation
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
This dataset contains Time-related features, Policy and vehicle-related features, and Accident-related features to detect fraudulent claims.
# gathering dataset for building prediction model
df = pd.read_csv("C:/Users/aswinram/Aswin's Data Science Portfolio/Vehicle Insurance Fraud Detection/data/fraud_oracle.csv")
# remove spaces in columns name
df.columns = df.columns.str.replace(' ','_')
# print shape of df
print("The shape of df:", df.shape)
df.head()
The shape of df: (15420, 33)
| Month | WeekOfMonth | DayOfWeek | Make | AccidentArea | DayOfWeekClaimed | MonthClaimed | WeekOfMonthClaimed | Sex | MaritalStatus | ... | AgeOfVehicle | AgeOfPolicyHolder | PoliceReportFiled | WitnessPresent | AgentType | NumberOfSuppliments | AddressChange_Claim | NumberOfCars | Year | BasePolicy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Dec | 5 | Wednesday | Honda | Urban | Tuesday | Jan | 1 | Female | Single | ... | 3 years | 26 to 30 | No | No | External | none | 1 year | 3 to 4 | 1994 | Liability |
| 1 | Jan | 3 | Wednesday | Honda | Urban | Monday | Jan | 4 | Male | Single | ... | 6 years | 31 to 35 | Yes | No | External | none | no change | 1 vehicle | 1994 | Collision |
| 2 | Oct | 5 | Friday | Honda | Urban | Thursday | Nov | 2 | Male | Married | ... | 7 years | 41 to 50 | No | No | External | none | no change | 1 vehicle | 1994 | Collision |
| 3 | Jun | 2 | Saturday | Toyota | Rural | Friday | Jul | 1 | Male | Married | ... | more than 7 | 51 to 65 | Yes | No | External | more than 5 | no change | 1 vehicle | 1994 | Liability |
| 4 | Jan | 5 | Monday | Honda | Urban | Tuesday | Feb | 2 | Female | Single | ... | 5 years | 31 to 35 | No | No | External | none | no change | 1 vehicle | 1994 | Collision |
5 rows × 33 columns
This sub-section aims to improve the feature recognition process. It involves:
Identifying the target feature(s) - The feature that the model aims to predict.
Grouping input features - The input features are grouped into different categories based on their data type or characteristics, including:
All features: inlcludes all the input features
Numeric features: features that represent numerical values such as age, price, days, etc.
Categorical features: features that represent discrete values such as make, policy type, marital status, etc.
Binary features: features that represent only two possible values such as police report filed, witness present, etc.
Ordinal features: categorical features that have a natural order such as driver rating.
Nominal features: categorical features that have no natural order such as make, agent type, etc.
High cardinality features: categorical features that have a large number of unique values such as policy number.
By grouping the features into different categories, it can help to identify which features may require additional preprocessing or encoding to be used effectively in a model. It can also help to guide the feature engineering process feature selection process.
# -----TARGET SELECTION-----
# Output Feature
target_feature = 'FraudFound_P'
print("Target Feature: \n", target_feature)
print()
# -----INPUT FEATURE RECOGNITION-----
# -----all features-----
all_features = df.columns.to_list()
all_features.remove(target_feature)
print('All Features: \n', all_features)
print()
# -----numeric features-----
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'object' and df[feature].dtype !='datetime64[ns]']
numeric_features.remove(target_feature)
print('Numeric Features: \n', numeric_features)
print()
# -----categorical features-----
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'object']
print('Categorical Features: \n', categorical_features)
print()
# -----binary features-----
binary_features = ['AccidentArea', 'Sex', 'Fault', 'PoliceReportFiled', 'WitnessPresent', 'AgentType']
print('Binary Features: \n', binary_features)
print()
# -----ordinal features-----
ordinal_features = ['VehiclePrice', 'Days_Policy_Accident', 'Days_Policy_Claim', 'PastNumberOfClaims',
'AgeOfVehicle', 'AgeOfPolicyHolder', 'NumberOfSuppliments', 'AddressChange_Claim', 'NumberOfCars']
print('Ordinal Features: \n', ordinal_features)
print()
# -----nominal features-----
nominal_features = ['Month', 'DayOfWeek', 'Make', 'DayOfWeekClaimed', 'MonthClaimed', 'MaritalStatus', 'PolicyType',
'VehicleCategory', 'BasePolicy']
print('Nominal Features: ', nominal_features)
print()
# -----high cardinality features-----
# Set the threshold for high cardinality
threshold = 7
# Calculate the number of unique values in each column
cardinality = df.nunique()
# Select the columns where the number of unique values is greater than the threshold
high_cardinality_features = cardinality[cardinality > threshold].index.tolist()
print('High cardinality features: ', high_cardinality_features)
print()
Target Feature: FraudFound_P All Features: ['Month', 'WeekOfMonth', 'DayOfWeek', 'Make', 'AccidentArea', 'DayOfWeekClaimed', 'MonthClaimed', 'WeekOfMonthClaimed', 'Sex', 'MaritalStatus', 'Age', 'Fault', 'PolicyType', 'VehicleCategory', 'VehiclePrice', 'PolicyNumber', 'RepNumber', 'Deductible', 'DriverRating', 'Days_Policy_Accident', 'Days_Policy_Claim', 'PastNumberOfClaims', 'AgeOfVehicle', 'AgeOfPolicyHolder', 'PoliceReportFiled', 'WitnessPresent', 'AgentType', 'NumberOfSuppliments', 'AddressChange_Claim', 'NumberOfCars', 'Year', 'BasePolicy'] Numeric Features: ['WeekOfMonth', 'WeekOfMonthClaimed', 'Age', 'PolicyNumber', 'RepNumber', 'Deductible', 'DriverRating', 'Year'] Categorical Features: ['Month', 'DayOfWeek', 'Make', 'AccidentArea', 'DayOfWeekClaimed', 'MonthClaimed', 'Sex', 'MaritalStatus', 'Fault', 'PolicyType', 'VehicleCategory', 'VehiclePrice', 'Days_Policy_Accident', 'Days_Policy_Claim', 'PastNumberOfClaims', 'AgeOfVehicle', 'AgeOfPolicyHolder', 'PoliceReportFiled', 'WitnessPresent', 'AgentType', 'NumberOfSuppliments', 'AddressChange_Claim', 'NumberOfCars', 'BasePolicy'] Binary Features: ['AccidentArea', 'Sex', 'Fault', 'PoliceReportFiled', 'WitnessPresent', 'AgentType'] Ordinal Features: ['VehiclePrice', 'Days_Policy_Accident', 'Days_Policy_Claim', 'PastNumberOfClaims', 'AgeOfVehicle', 'AgeOfPolicyHolder', 'NumberOfSuppliments', 'AddressChange_Claim', 'NumberOfCars'] Nominal Features: ['Month', 'DayOfWeek', 'Make', 'DayOfWeekClaimed', 'MonthClaimed', 'MaritalStatus', 'PolicyType', 'VehicleCategory', 'BasePolicy'] High cardinality features: ['Month', 'Make', 'DayOfWeekClaimed', 'MonthClaimed', 'Age', 'PolicyType', 'PolicyNumber', 'RepNumber', 'AgeOfVehicle', 'AgeOfPolicyHolder']
Exploratory Data Analysis (EDA) plays a crucial role in enhancing the performance of machine learning models. It helps in identifying errors, detecting patterns, selecting relevant features, improving model accuracy, and effectively communicating insights. In this section, histograms are created for both numerical and categorical features along with the colored target feature to determine the presence of fraudulent activities across different input features. This step provides essential insights into the data and assists in making informed decisions for the subsequent stages of the model development process.
# Create histograms for numeric features
for col in numeric_features:
fig = px.histogram(df, x=col, nbins=20, color=target_feature, barmode="overlay")
fig.show()
# Create bar plots for categorical features
for col in categorical_features:
fig = px.histogram(df, x=col, color=target_feature)
fig.update_layout(barmode="overlay")
fig.show()
# Check if there are any missing values in the DataFrame
if df.isnull().any().any():
print('There are missing values in the DataFrame')
else:
print('There are no missing values in the DataFrame')
There are no missing values in the DataFrame
Data splitting into train, validation, and test sets is important for machine learning to ensure the model's performance is evaluated on unseen data and to avoid overfitting. Stratifying the y variable is important to preserve the distribution of the target variable in each set, especially for imbalanced datasets. For the size of the dataset used in this project, stratification ensures representative data is used for training, validation, and testing, leading to accurate model performance evaluation.
Note: While all features in the dataset were included in this section as they provided the best ML performance, it's important to note that in the real world, it's necessary to conduct rigorous experimentation to identify the best subset of features from all the available features in the input dataset.
# Assign input features (also for feature selection)
X = df[['Month', 'WeekOfMonth', 'DayOfWeek', 'Make', 'AccidentArea',
'DayOfWeekClaimed', 'MonthClaimed', 'WeekOfMonthClaimed', 'Sex',
'MaritalStatus', 'Age', 'Fault', 'PolicyType', 'VehicleCategory',
'VehiclePrice', 'PolicyNumber', 'RepNumber',
'Deductible', 'DriverRating', 'Days_Policy_Accident',
'Days_Policy_Claim', 'PastNumberOfClaims', 'AgeOfVehicle',
'AgeOfPolicyHolder', 'PoliceReportFiled', 'WitnessPresent', 'AgentType',
'NumberOfSuppliments', 'AddressChange_Claim', 'NumberOfCars', 'Year',
'BasePolicy']]
# Assign Target Feature
y = df[target_feature]
# Perform stratified train_val-test split for input features
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)
# Further split the training set into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.20, stratify= y_train_val, random_state=0)
# Print shapes of the datasets
print('Shape of X_train: ', X_train.shape)
print('Shape of y_train: ', y_train.shape)
print('Shape of X_val: ', X_val.shape)
print('Shape of y_val: ', y_val.shape)
print('Shape of X_test: ', X_test.shape)
print('Shape of y_test: ', y_test.shape)
print()
# Dataset Splitting Summary
total_samples = X_train.shape[0] + X_val.shape[0] + X_test.shape[0]
train_percent = X_train.shape[0] / total_samples * 100
val_percent = X_val.shape[0] / total_samples * 100
test_percent = X_test.shape[0] / total_samples * 100
print(f"Training set percentage: {train_percent:.2f}%")
print(f"Validation set percentage: {val_percent:.2f}%")
print(f"Test set percentage: {test_percent:.2f}%")
Shape of X_train: (9252, 32) Shape of y_train: (9252,) Shape of X_val: (2313, 32) Shape of y_val: (2313,) Shape of X_test: (3855, 32) Shape of y_test: (3855,) Training set percentage: 60.00% Validation set percentage: 15.00% Test set percentage: 25.00%
Data preprocessing is an essential step in machine learning that involves transforming raw data into a format suitable for modeling. The process includes data cleaning, feature engineering, and feature scaling, among others. In this case, data cleaning was not performed since the dataset has no missing values. Also, outlier detection was not carried out since this is a classification problem. However, other preprocessing techniques such as feature engineering and scaling techniques are used to improve model performance.
Feature engineering is crucial for machine learning modeling. In my approach, I utilized different encoding techniques such as count encoder for high cardinality features, binary encoder for binary features, ordinal encoding for ordinal features, and one hot encoder for nominal features to preprocess the features for improved model performance.
# Create an instance of CountEncoder
c_encoder = ce.CountEncoder()
X_train_c_encoded, X_val_c_encoded, X_test_c_encoded = c_encoder.fit_transform(X_train[high_cardinality_features]), c_encoder.transform(X_val[high_cardinality_features]), c_encoder.transform(X_test[high_cardinality_features])
# Create an instance of BinaryEncoder
b_encoder = BinaryEncoder()
X_train_b_encoded, X_val_b_encoded, X_test_b_encoded = b_encoder.fit_transform(X_train[binary_features]), b_encoder.transform(X_val[binary_features]), b_encoder.transform(X_test[binary_features])
# specify the order of categories for each feature
categories = [
['less than 20000', '20000 to 29000', '30000 to 39000', '40000 to 59000', '60000 to 69000', 'more than 69000' ], # VehiclePrice
['none', '1 to 7', '8 to 15', '15 to 30', 'more than 30'], # Days_Policy_Accident
['none', '8 to 15' , '15 to 30', 'more than 30' ], # Days_Policy_Claim
['none', '1' , '2 to 4', 'more than 4'], # PastNumberOfClaims
['new', '2 years', '3 years', '4 years', '5 years', '6 years', '7 years', 'more than 7'], # AgeOfVehicle
['16 to 17', '18 to 20', '21 to 25', '26 to 30', '31 to 35', '36 to 40', '41 to 50', '51 to 65', 'over 65'], # AgeOfPolicyHolder
['none', '1 to 2', '3 to 5', 'more than 5'], # NumberOfSuppliments
['no change', 'under 6 months', '1 year', '2 to 3 years', '4 to 8 years'], #AddressChange_Claim
['1 vehicle', '2 vehicles', '3 to 4', '5 to 8', 'more than 8'], # NumberOfCars
]
# Create an instance of Ordinal Encoder
ord_encoder = OrdinalEncoder(categories=categories)
X_train_ord_encoded, X_val_ord_encoded, X_test_ord_encoded = ord_encoder.fit_transform(X_train[ordinal_features]), ord_encoder.transform(X_val[ordinal_features]), ord_encoder.transform(X_test[ordinal_features])
# Get the names of the ordinal columns
column_names = ordinal_features
# convert X_train_encoded, X_val_encoded, X_test_encoded numpy array to DataFrame
X_train_ord_encoded, X_val_ord_encoded, X_test_ord_encoded = pd.DataFrame(X_train_ord_encoded, columns=column_names), pd.DataFrame(X_val_ord_encoded, columns=column_names), pd.DataFrame(X_test_ord_encoded, columns=column_names)
# Create an instance of OneHotEncoder
encoder = OneHotEncoder(handle_unknown = 'ignore')
X_train_o_encoded, X_val_o_encoded, X_test_o_encoded = encoder.fit_transform(X_train[nominal_features]), encoder.transform(X_val[nominal_features]), encoder.transform(X_test[nominal_features])
# Get the names of the binary columns
column_names = encoder.get_feature_names_out(nominal_features)
# convert X_train_encoded, X_val_encoded, X_test_encoded to dense numpy array
X_train_o_encoded, X_val_o_encoded, X_test_o_encoded = X_train_o_encoded.toarray(), X_val_o_encoded.toarray(), X_test_o_encoded.toarray()
# convert X_train_encoded, X_val_encoded, X_test_encoded dense numpy array to DataFrame
X_train_o_encoded, X_val_o_encoded, X_test_o_encoded = pd.DataFrame(X_train_o_encoded, columns=column_names), pd.DataFrame(X_val_o_encoded, columns=column_names), pd.DataFrame(X_test_o_encoded, columns=column_names)
# Combine the encoded DataFrames
# Reset the index of each DataFrame
X_train_c_encoded, X_train_b_encoded, X_train_ord_encoded, X_train_o_encoded = X_train_c_encoded.reset_index(drop=True), X_train_b_encoded.reset_index(drop=True), X_train_ord_encoded.reset_index(drop=True), X_train_o_encoded.reset_index(drop=True)
X_val_c_encoded, X_val_b_encoded, X_val_ord_encoded, X_val_o_encoded = X_val_c_encoded.reset_index(drop=True), X_val_b_encoded.reset_index(drop=True), X_val_ord_encoded.reset_index(drop=True), X_val_o_encoded.reset_index(drop=True)
X_test_c_encoded, X_test_b_encoded, X_test_ord_encoded, X_test_o_encoded = X_test_c_encoded.reset_index(drop=True), X_test_b_encoded.reset_index(drop=True), X_test_ord_encoded.reset_index(drop=True), X_test_o_encoded.reset_index(drop=True)
# Combine the encoded DataFrames using pd.concat
X_train_encoded = pd.concat([X_train_c_encoded, X_train_b_encoded, X_train_ord_encoded, X_train_o_encoded], axis=1)
X_val_encoded = pd.concat([X_val_c_encoded, X_val_b_encoded, X_val_ord_encoded, X_val_o_encoded], axis=1)
X_test_encoded = pd.concat([X_test_c_encoded, X_test_b_encoded,X_test_ord_encoded, X_test_o_encoded], axis=1)
Feature scaling is important for machine learning modeling as it transforms the features to a common scale, ensuring that no single feature dominates the others during model training. Scaling can help improve model performance by reducing the impact of the differences in feature scales, which can otherwise lead to biased results. In this jupyter notebook, I have used sklearn's StandardScaler to perform the feature scaling on the X_train_scaled, X_val_scaled, X_test_scaled datasets.
# Create an instance of StandardScaler
scaler = StandardScaler(with_mean=False)
X_train_scaled, X_val_scaled, X_test_scaled = scaler.fit_transform(X_train_encoded), scaler.transform(X_val_encoded), scaler.transform(X_test_encoded)
Model training involves selecting an appropriate algorithm and fine-tuning its parameters to obtain the best possible model for a given dataset. In this case, logistic regression, random forest, and XGBoost classifiers were trained and fine-tuned using random search CV with 5-fold cross-validation. The best model selected based on cross-validation performance was then trained on the entire train set and evaluated on the validation set. Once the model was optimized, it was evaluated on the test set to ensure that it generalizes well to unseen data.
During model training, it is essential to monitor for underfitting and overfitting. Underfitting occurs when a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both the training and validation sets. Overfitting, on the other hand, occurs when a model is too complex and captures noise in the training data, leading to excellent performance on the training set but poor performance on the validation set.
To ensure good performance on the test set, it is crucial to select a model that achieves a balance between underfitting and overfitting. The selected model should have good performance on both the training and validation sets while also generalizing well to the test set. A model that achieves good performance on the test set is likely to perform well on new, unseen data, and is considered to be a good model.
# define the models and hyperparameter search spaces
models = {
'lr': {
'model': LogisticRegression(random_state=0),
'param_distributions': {
# Regularization strength
'C': np.logspace(-10, 10, 21),
# Solver for optimization
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'saga'],
# Maximum number of iterations
'max_iter': randint(100, 1000),
}
},
'rf': {
'model': RandomForestClassifier(random_state=0),
'param_distributions': {
# number of estimators
'n_estimators': [50, 300],
# Regularization parameter
'max_features': ['sqrt', 'log2'],
# Regularization parameter
'max_depth': [3, 7, None],
# Regularization parameter
'min_samples_split': [1, 2, 4],
# Regularization parameter
'min_samples_leaf': [2, 5, 10],
# bootstrap strategy
'bootstrap': [True, False]
}
},
'xgb': {
'model': XGBClassifier(random_state=0),
'param_distributions': {
# number of estimators
'n_estimators': [50, 300],
# learning rate
'learning_rate': [0.01, 0.1],
# Regularization parameter
'max_depth': [3, 7],
# Regularization parameter
'min_child_weight': [1, 5],
# Regularization parameter
'gamma': [0.5, 1],
# subsample ratio of columns when constructing each tree
'colsample_bytree': [0.3, 0.7]
}
}
}
# Define cross-validation method
cv = KFold(n_splits=5)
# Perform hyperparameter tuning on all models
best_models = {}
for name in models:
print(f'{name}:')
# ignore all warnings
warnings.filterwarnings('ignore')
# define the random search object
random_search = RandomizedSearchCV(
models[name]['model'],
param_distributions=models[name]['param_distributions'],
n_iter=10,
cv=cv,
scoring='f1_macro',
random_state=0)
# perform hyperparameter tuning with random search
random_search.fit(X_train_scaled, y_train)
# get the best model and its hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_
# store the best model in the dictionary
best_models[name] = {'model': best_model, 'params': best_params}
# print best hyperparameters and best score
print(f'Best hyperparameters: {random_search.best_params_}')
print(f'Best f1 score: {random_search.best_score_:.3f}')
print()
lr:
Best hyperparameters: {'C': 100.0, 'max_iter': 659, 'solver': 'lbfgs'}
Best f1 score: 0.488
rf:
Best hyperparameters: {'n_estimators': 300, 'min_samples_split': 1, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': None, 'bootstrap': False}
Best f1 score: 0.486
xgb:
Best hyperparameters: {'n_estimators': 300, 'min_child_weight': 1, 'max_depth': 7, 'learning_rate': 0.1, 'gamma': 0.5, 'colsample_bytree': 0.7}
Best f1 score: 0.696
# get best model name and score
best_model = best_models['xgb']['model']
print('Best Model:', best_model)
Best Model: XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=0.7, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=0.5, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=7, max_leaves=None,
min_child_weight=1, missing=nan, monotone_constraints=None,
n_estimators=300, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=0, ...)
# fit model on entire train set
best_model.fit(X_train_scaled, y_train)
# predict model on train set
y_pred_train = best_model.predict(X_train_scaled)
# Generate the classification report
report = classification_report(y_train, y_pred_train)
# Print the report
print(report)
precision recall f1-score support
0 1.00 1.00 1.00 8698
1 1.00 0.99 1.00 554
accuracy 1.00 9252
macro avg 1.00 1.00 1.00 9252
weighted avg 1.00 1.00 1.00 9252
# fit model on entire train set
best_model.fit(X_val_scaled, y_val)
# predict model on train set
y_pred_val = best_model.predict(X_val_scaled)
# Generate the classification report
report = classification_report(y_val, y_pred_val)
# Print the report
print(report)
precision recall f1-score support
0 1.00 1.00 1.00 2175
1 1.00 1.00 1.00 138
accuracy 1.00 2313
macro avg 1.00 1.00 1.00 2313
weighted avg 1.00 1.00 1.00 2313
# fit model on entire train set
best_model.fit(X_test_scaled, y_test)
# predict model on train set
y_pred_test = best_model.predict(X_test_scaled)
# Generate the classification report
report = classification_report(y_test, y_pred_test)
# Print the report
print(report)
precision recall f1-score support
0 1.00 1.00 1.00 3624
1 1.00 1.00 1.00 231
accuracy 1.00 3855
macro avg 1.00 1.00 1.00 3855
weighted avg 1.00 1.00 1.00 3855
Model interpretation is the process of understanding how a machine learning model works and why it makes certain predictions. It helps to gain insights into the underlying relationships between the input features and the target variable. One way to interpret a model is to analyze the importance of the input features on the model's predictions.
In this jupyter notebook, ANOVA (Analysis of Variance), a statistical technique used to determine whether there is a significant difference between the means of two or more groups. In the context of model interpretation, ANOVA can be used to test the significance of individual input features on the model's predictions. To obtain feature importance using ANOVA, the F-statistic and associated p-value for each input feature is calculated. The F-statistic measures the ratio of the variance between the means of different groups to the variance within each group. A high F-statistic and low p-value indicate that the feature is significantly correlated with the target variable and has a strong influence on the model's predictions. This information can be used to identify the most important features and potentially improve the model's performance.
# Perform ANOVA test on the scaled features
f_scores, p_values = f_classif(X_test_scaled, y_test)
# Create a dictionary with the feature names and their respective f-scores
feature_scores = dict(zip(encoder.get_feature_names_out(), f_scores))
# Sort the features based on their f-scores in descending order
sorted_features = {k: v for k, v in sorted(feature_scores.items(), key=lambda item: item[1], reverse=True)}
# Select the top 5 features based on their f-scores
top_features = list(sorted_features.keys())[:25]
# Generate random colors for each feature
colors = [f'rgb({random.randint(0, 255)}, {random.randint(0, 255)}, {random.randint(0, 255)})' for _ in range(len(top_features))]
# Plot the top 5 features based on their f-scores
fig = go.Figure()
fig.add_trace(go.Bar(x=top_features, y=[sorted_features[f] for f in top_features], marker_color= colors))
fig.update_layout(title='Top 25 Features based on F statistic-Score', xaxis_title='Features', yaxis_title='F statistic-Score')
fig.show()
In conclusion, this Jupyter notebook presented various stages of the ML model development process for the vehicle insurance fraud detection dataset, including data preprocessing, machine learning model training and evaluation, and model interpretation. Amongst the logistic regression, random forest, and XGBoost classifiers, the XGBoost classifier yielded the best performance and therefore, it is selected as the best model. Although the XGBoost classifier demonstrated perfect performance in terms of precision, recall, and F1 score for the kaggle dataset, it is important to note that perfect scores may not be achievable for real-world dataset. Overall, the results demonstrate the effectiveness of the developed XGBoost model in detecting fraudulent claims, making it a valuable resource for insurance companies looking to improve their vehicle insurance fraud detection capabilities.
The next steps could include deploying the developed XGBoost model in a real-world scenario and monitoring its performance on a continuous basis. The model could also be tested on a larger dataset to evaluate its scalability and robustness. Additionally, further analysis could be conducted on the features that were found to be important by the ANOVA test to gain deeper insights into the factors that contribute to fraudulent claims. Finally, the results and findings from this project could be documented and shared with relevant stakeholders, such as insurance companies or researchers, to advance the knowledge and understanding of vehicle insurance fraud detection.